24 research outputs found
Modeling the Contribution of Central Versus Peripheral Vision in Scene, Object, and Face Recognition
It is commonly believed that the central visual field is important for
recognizing objects and faces, and the peripheral region is useful for scene
recognition. However, the relative importance of central versus peripheral
information for object, scene, and face recognition is unclear. In a behavioral
study, Larson and Loschky (2009) investigated this question by measuring the
scene recognition accuracy as a function of visual angle, and demonstrated that
peripheral vision was indeed more useful in recognizing scenes than central
vision. In this work, we modeled and replicated the result of Larson and
Loschky (2009), using deep convolutional neural networks. Having fit the data
for scenes, we used the model to predict future data for large-scale scene
recognition as well as for objects and faces. Our results suggest that the
relative order of importance of using central visual field information is face
recognition>object recognition>scene recognition, and vice-versa for peripheral
information.Comment: CogSci 2016 Conference Pape
Are Face and Object Recognition Independent? A Neurocomputational Modeling Exploration
Are face and object recognition abilities independent? Although it is
commonly believed that they are, Gauthier et al.(2014) recently showed that
these abilities become more correlated as experience with nonface categories
increases. They argued that there is a single underlying visual ability, v,
that is expressed in performance with both face and nonface categories as
experience grows. Using the Cambridge Face Memory Test and the Vanderbilt
Expertise Test, they showed that the shared variance between Cambridge Face
Memory Test and Vanderbilt Expertise Test performance increases monotonically
as experience increases. Here, we address why a shared resource across
different visual domains does not lead to competition and to an inverse
correlation in abilities? We explain this conundrum using our
neurocomputational model of face and object processing (The Model, TM). Our
results show that, as in the behavioral data, the correlation between
subordinate level face and object recognition accuracy increases as experience
grows. We suggest that different domains do not compete for resources because
the relevant features are shared between faces and objects. The essential power
of experience is to generate a "spreading transform" for faces that generalizes
to objects that must be individuated. Interestingly, when the task of the
network is basic level categorization, no increase in the correlation between
domains is observed. Hence, our model predicts that it is the type of
experience that matters and that the source of the correlation is in the
fusiform face area, rather than in cortical areas that subserve basic level
categorization. This result is consistent with our previous modeling
elucidating why the FFA is recruited for novel domains of expertise (Tong et
al., 2008)
Understanding Convolution for Semantic Segmentation
Recent advances in deep learning, especially deep convolutional neural
networks (CNNs), have led to significant improvement over previous semantic
segmentation systems. Here we show how to improve pixel-wise semantic
segmentation by manipulating convolution-related operations that are of both
theoretical and practical value. First, we design dense upsampling convolution
(DUC) to generate pixel-level prediction, which is able to capture and decode
more detailed information that is generally missing in bilinear upsampling.
Second, we propose a hybrid dilated convolution (HDC) framework in the encoding
phase. This framework 1) effectively enlarges the receptive fields (RF) of the
network to aggregate global information; 2) alleviates what we call the
"gridding issue" caused by the standard dilated convolution operation. We
evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a
state-of-art result of 80.1% mIOU in the test set at the time of submission. We
also have achieved state-of-the-art overall on the KITTI road estimation
benchmark and the PASCAL VOC2012 segmentation task. Our source code can be
found at https://github.com/TuSimple/TuSimple-DUC .Comment: WACV 2018. Updated acknowledgements. Source code:
https://github.com/TuSimple/TuSimple-DU
HPLFlowNet: Hierarchical Permutohedral Lattice FlowNet for Scene Flow Estimation on Large-scale Point Clouds
We present a novel deep neural network architecture for end-to-end scene flow
estimation that directly operates on large-scale 3D point clouds. Inspired by
Bilateral Convolutional Layers (BCL), we propose novel DownBCL, UpBCL, and
CorrBCL operations that restore structural information from unstructured point
clouds, and fuse information from two consecutive point clouds. Operating on
discrete and sparse permutohedral lattice points, our architectural design is
parsimonious in computational cost. Our model can efficiently process a pair of
point cloud frames at once with a maximum of 86K points per frame. Our approach
achieves state-of-the-art performance on the FlyingThings3D and KITTI Scene
Flow 2015 datasets. Moreover, trained on synthetic data, our approach shows
great generalization ability on real-world data and on different point
densities without fine-tuning
LidarMultiNet: Towards a Unified Multi-task Network for LiDAR Perception
LiDAR-based 3D object detection, semantic segmentation, and panoptic
segmentation are usually implemented in specialized networks with distinctive
architectures that are difficult to adapt to each other. This paper presents
LidarMultiNet, a LiDAR-based multi-task network that unifies these three major
LiDAR perception tasks. Among its many benefits, a multi-task network can
reduce the overall cost by sharing weights and computation among multiple
tasks. However, it typically underperforms compared to independently combined
single-task models. The proposed LidarMultiNet aims to bridge the performance
gap between the multi-task network and multiple single-task networks. At the
core of LidarMultiNet is a strong 3D voxel-based encoder-decoder architecture
with a Global Context Pooling (GCP) module extracting global contextual
features from a LiDAR frame. Task-specific heads are added on top of the
network to perform the three LiDAR perception tasks. More tasks can be
implemented simply by adding new task-specific heads while introducing little
additional cost. A second stage is also proposed to refine the first-stage
segmentation and generate accurate panoptic segmentation results. LidarMultiNet
is extensively tested on both Waymo Open Dataset and nuScenes dataset,
demonstrating for the first time that major LiDAR perception tasks can be
unified in a single strong network that is trained end-to-end and achieves
state-of-the-art performance. Notably, LidarMultiNet reaches the official 1st
place in the Waymo Open Dataset 3D semantic segmentation challenge 2022 with
the highest mIoU and the best accuracy for most of the 22 classes on the test
set, using only LiDAR points as input. It also sets the new state-of-the-art
for a single model on the Waymo 3D object detection benchmark and three
nuScenes benchmarks.Comment: Full-length paper extending our previous technical report of the 1st
place solution of the 2022 Waymo Open Dataset 3D Semantic Segmentation
challenge, including evaluations on 5 major benchmarks. arXiv admin note:
text overlap with arXiv:2206.1142
LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception
There is a recent trend in the LiDAR perception field towards unifying
multiple tasks in a single strong network with improved performance, as opposed
to using separate networks for each task. In this paper, we introduce a new
LiDAR multi-task learning paradigm based on the transformer. The proposed
LiDARFormer utilizes cross-space global contextual feature information and
exploits cross-task synergy to boost the performance of LiDAR perception tasks
across multiple large-scale datasets and benchmarks. Our novel
transformer-based framework includes a cross-space transformer module that
learns attentive features between the 2D dense Bird's Eye View (BEV) and 3D
sparse voxel feature maps. Additionally, we propose a transformer decoder for
the segmentation task to dynamically adjust the learned features by leveraging
the categorical feature representations. Furthermore, we combine the
segmentation and detection features in a shared transformer decoder with
cross-task attention layers to enhance and integrate the object-level and
class-level features. LiDARFormer is evaluated on the large-scale nuScenes and
the Waymo Open datasets for both 3D detection and semantic segmentation tasks,
and it outperforms all previously published methods on both tasks. Notably,
LiDARFormer achieves the state-of-the-art performance of 76.4% L2 mAPH and
74.3% NDS on the challenging Waymo and nuScenes detection benchmarks for a
single model LiDAR-only method.Comment: ICRA 202